Augmenting Focused Crawling Using Search Engine Queries
نویسندگان
چکیده
The pervasiveness of the Internet makes it an ideal medium for sharing scholarly information. Nowadays, many authors post their publications online so that others may easily access to them, increasing the author’s impact in his/her research area. In this project, we develop a focused crawling to find publication pages, web pages that link to online, freely available scholarly publications. In contrast to previous works which only traverse hyperlinks within web pages, our algorithm leverages search engine queries to locate suitable pages for crawling. This strategy allows our crawler to locate more relevant pages and lessens the reliance of the crawler on the quality of seed pages used to start the crawling processing. Our crawler is also able to locate relevant pages that are not accessible by standard crawling that work only by following hyperlinks. Our results show our system is able to avoid slow start, and find publication pages faster, outperform local crawling methods.
منابع مشابه
Augmenting Focused Crawling using Search Engine Queries
.......................................................................................................................III Acknowledgement....................................................................................................... IV Table of
متن کاملFocused Crawling using Asynchronous Cellular Learning Automata
Web crawling is used to collect the web pages which will be indexed by a search engine. The search engine uses these crawled and indexed pages to answer users’ queries. Since the volume of web pages is very high and it increases continuously, search engines can index a limited number of web pages. Therefore, in recent years, the focused crawler algorithms have been introduced which act selectiv...
متن کاملCollecte orientée sur le Web pour la recherche d'information spécialisée. (Focused document gathering on the Web for domain-specific information retrieval)
Focused document gathering on the Web for domain-specific information retrieval Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithm...
متن کاملA New Approach Towards Vertical Search Engines - Intelligent Focused Crawling and Multilingual Semantic Techniques
Search engines typically consist of a crawler which traverses the web retrieving documents and a search frontend which provides the user interface to the acquired information. Focused crawlers refine the crawler by intelligently directing it to predefined topic areas. The evolution of search engines today is expedited by supplying more search capabilities such as a search for metadata as well a...
متن کاملDHT-Based Distributed Crawler
A search engine, like Google, is built using two pieces of infrastructure a crawler that indexes the web and a searcher that uses the index to answer user queries. While Google's crawler has worked well, there is the issue of timeliness and the lack of control given to end-users to direct the crawl according to their interests. The interface presented by such search engines is hence very limite...
متن کامل